02/07/2021
Binary files:
Example:
>ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA| GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGT CCCAAGCTTAGTTTGGGACATACTTATGCTAAAAAACATTATTGGTTGTTTATCTGAGAT TCAGAATTAAGCATTTTA
The Sequence ID must be unique, and should not contain spaces.
Drosophila melanogaster rRNA:
>gi|174298|gb|M25016.1|DRORR5SEM_D.melanogaster_5S_rRNA GCCAACGACCATACCACGCTGAATACATCGGTTCTCGTCCGATCACCGAAATTAAGC AGCGTCGCGGGCGGTTAGTACTTAGATGGGGGACCGCTTGGGAACACCGCGTGTTGT TGGCCT
(NCBI 2021)
( “NGS Sequencing Technology and File Formats,” n.d.; Peter J. A. Cock 2010)
Each sequence in a FASTQ file contains 4 lines:
LINE 1: @Sequence_ID:optional description of sequencing run LINE 2: Raw sequence letters (A,C,T,G,N) LINE 3: + (a separator) LINE 4: Quality scores of sequence
Example:
@NB501623:178:HJLC2BGX5:1:11101:5397:1056 1:N:0:AGTCAA CGGTCNGTGAAGAGTCGAACGTGCTCTGCNGNAGATCGGAAGAGCACACNTCTGANCTCNAGTCACANTNANATNT + AAAAA#EEEEEEEEE<E/EEEEEEEE/EE#/#EE/EEEE/EE<AE/EEE#EEAE/#EAE#EEAE/EA#E#E#EE#/
A = 32, E = 36, # = 35, < = 27
NAME LENGTH OFFSET LINEBASES LINEWIDTH QUALOFFSET
NAME LENGTH OFFSET LINEBASES LINEWIDTH QUALOFFSET
Where:
NAME LENGTH OFFSET LINEBASES LINEWIDTH QUALOFFSET
Example: GRCh38.primary_assembly.genome.fa.fai
chr1 248956422 8 60 61 chr2 242193529 253105712 60 61 chr3 198295559 499335808 60 61
Meme credit: @BioMickWatson (Twitter)
seqid source type start end score strand phase attributes
seqid source type start end score strand phase attributes
Where:
seqid source type start end score strand phase attributes
Example:
##gff-version 3.1.26 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
The first line is a comment that defines the version.
(Stein 2020)
| Seq | 1-based | 0-based |
|---|---|---|
| ATG | chr1:1-3 | chr1:0-3 |
| C | chr1:7-7 | chr1:6-7 |
(Griffith 2013)
Example:
@HD VN:1.0 SO:unsorted @SQ SN:gi|158246|gb|M21017.1|DRORGAB LN:12026 @SQ SN:gi|174298|gb|M25016.1|DRORR5SEM LN:120 @PG ID:bowtie2 PN:bowtie2 VN: CL:"bowtie2-align-s --wrapper basic-0 --threads 6 --trim3 1 -k 1 -x bowtie_indexes/rRNA_fly --passthrough -U Quality_filter_outputs/SRR1548656.qualfilt_output.fastq"
@HD: File level metadata
Example:
@HD VN:1.0 SO:unsorted @SQ SN:gi|158246|gb|M21017.1|DRORGAB LN:12026 @SQ SN:gi|174298|gb|M25016.1|DRORR5SEM LN:120 @PG ID:bowtie2 PN:bowtie2 VN: CL:"bowtie2-align-s --wrapper basic-0 --threads 6 --trim3 1 -k 1 -x bowtie_indexes/rRNA_fly --passthrough -U Quality_filter_outputs/SRR1548656.qualfilt_output.fastq"
@SQ: Reference sequence metadata
Example:
@HD VN:1.0 SO:unsorted @SQ SN:gi|158246|gb|M21017.1|DRORGAB LN:12026 @SQ SN:gi|174298|gb|M25016.1|DRORR5SEM LN:120 @PG ID:bowtie2 PN:bowtie2 VN: CL:"bowtie2-align-s --wrapper basic-0 --threads 6 --trim3 1 -k 1 -x bowtie_indexes/rRNA_fly --passthrough -U Quality_filter_outputs/SRR1548656.qualfilt_output.fastq"
@RF: Read group - multiple lines allowed
@PG: Program
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL
Where:
Example: position=2, CIGAR=3M2I3M
AAGTC TAGAA (ref) GTCGATAG (query)
Starting from 2, 3 matches, 2 inserts, 3 matches.
(Fan 2017)
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL
Where:
(The SAM/BAM Format Specification Working Group 2021)
Format - TAG : TYPE : VALUE
Examples:
(The SAM/BAM Format Specification Working Group 2020)
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL
SRR1548656.36 0 gi|158246|gb|M21017.1|DRORGAB 2873 255 29M * 0 0 TGCTTNGACTACATATGGTTGAGGGTTGT CCCFF#2AFHHHHJJIJJJJJIJJJHIJJ AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:5G23 YT:Z:UU
samtools view sample.sorted.bam | head -n 5
(Genome Research Limited 2021; Danecek et al. 2021)
For example:
Unsorted SAM file from aligner, convert to BAM, sort and index for downstream steps.
chrom chromStart chromEnd name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts
Required fields:
Optional fields
(Ensembl 2021)
chrom chromStart chromEnd name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts
chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255
#Create a tarball tar -czvf filename.tar.gz /path/to/dir #Extract a tarball tar -xzvf filename.tar.gz
Where:
Example on mac:
md5sum TLUK2021_talk.Rmd MD5 (TLUK2021_talk.Rmd) = 01b9630ca402b38d9d50567fa783f92d
Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “ Twelve years of SAMtools and BCFtools.” GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.
Ensembl. 2021. “BED File Format - Definition and Supported Options.” https://www.ensembl.org/info/website/upload/bed.html.
Fan, Jean. 2017. “Cigar Strings for Dummies.” https://jef.works/blog/2017/03/28/CIGAR-strings-for-dummies/.
Genome Research Limited. 2021. “Samtools.” https://www.htslib.org/.
Griffith, Obi. 2013. “Tutorial:cheat Sheet for One-Based Vs Zero-Based Coordinate Systems.” https://www.biostars.org/p/84686/#290319.
NCBI. 2021. “FASTA Format for Nucleotide Sequences.” https://www.ncbi.nlm.nih.gov/genbank/fastaformat/.
“NGS Sequencing Technology and File Formats.” n.d. https://learn.gencore.bio.nyu.edu/ngs-file-formats/.
Peter J. A. Cock, Naohisa Goto, Christopher J. Fields. 2010. “The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants.” Nucleic Acids Research 38 (6): 1767–71. https://doi.org/10.1093/nar/gkp1137.
Stein, Lincoln. 2020. “Generic Feature Format Version 3 (Gff3).” https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md.
The SAM/BAM Format Specification Working Group. 2020. “Sequence Alignment/Map Optional Fields Specification.” https://samtools.github.io/hts-specs/SAMtags.pdf.
———. 2021. “Sequence Alignment/Map Format Specification.” https://samtools.github.io/hts-specs/SAMv1.pdf.